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Abstract 

The reconstruction of phylogenies from DNA or protein sequences is 
a major task of computational evolutionary biology. Common phenomena, 
notably variations in mutation rates across genomes and incongruences be- 
tween gene lineage histories, often make it necessary to model molecular 
data as originating from a mixture of phylogenies. Such mixed models play 
an increasingly important role in practice. 

Using concentration of measure techniques, we show that mixtures of 
large trees are typically identifiable. We also derive sequence-length re- 
quirements for high-probability reconstruction. 
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1 Introduction 



Phylogenetics [SS03, Fel04] is centered around the reconstruction of evolutionary 
histories from molecular data extracted from modern species. The assumption is 
that molecular data consists of aligned sequences and that each position in the 
sequences evolves independently according to a Markov model on a tree, where 
the key parameters are (see Section 3 for formal definitions): 

• Rate matrix. An r x r mutation rate matrix Q, where r is the alphabet 
size. A typical alphabet is the set of nucleotides {A, C, G, T}, but here we 
allow more general state spaces. Without loss of generality, we denote the 
alphabet by 1Z = [r] = {1, . . . , r}. The (i, j)'th entry of Q encodes the rate 
at which state % mutates into state j. 

• Binary Tree. An evolutionary tree T, where the leaves are the modern 
species and each branching represents a past speciation event. The leaves 
are labelled with names of species. Without loss of generality, we the labels 

X = [n]. 

• Branch lengths. For each edge e, we have a scalar branch length w e which 
measures the expected total number of substitutions per site along edge e. 
Roughly speaking, w e is the amount of mutational change between the end 
points of e. 

The classical problem in phylogenetics can be stated as follows: 

• Phylogenetic Tree Reconstruction (PTR): Unmixed case. Given n molecular 
sequences of length k 

{ S a — ( s a)i=l}ae[n] 

with s l a E [r], which have evolved according to the process above with 
independent sites, reconstruct the topology of the evolutionary tree. 

There exists a vast theoretical literature on this problem. See e.g. [SS03] and 
references therein. 

However various phenomena, notably variations in mutation rates across genomes 
and incongruences between gene lineage histories, often make it necessary to 
model molecular data as originating from a mixture of different phylogenies. Us- 
ing concentration of measure techniques, we show that mixtures of large trees 
are typically identifiable. We also derive sequence-length requirements for high- 
probability reconstruction. 
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1.1 Related work 



Most prior theoretical work on mixture models has focused on the question of 
identifiability . A class of phylogenetic models is identifiable if any two models 
in the class produce different data distributions. It is well-known that unmixed 
phylogenetic models are typically identifiable [Cha96]. This is not the case in 
general for mixtures of phytogenies. For instance, Steel et al. [SSH94] showed 
that for any two trees one can find a random scaling on each of them such that 
their data distributions are identical. Hence it is hopeless in general to reconstruct 
phylogenies under mixture models. See also [EW04, MS07, MMS08, SV07b, 
SV07a, Ste09] for further examples of this type. 

However the negative examples constructed in the references above are not 
necessarily typical. They use special features of the mutation models (and their 
invariants) and allow themselves quite a bit of flexibility in setting up the topolo- 
gies and branch lengths. In fact, recently a variety of more standard mixture 
models have been shown to be identifiable. These include the common GTR+r 
model [AAR08, WS10] and GTR+r+I model [CH11], as well as some covarion 
models [AR06], some group-based models [APRS 11], and so-called r-component 
identical tree mixtures [RS10]. Although these results do not provide practical al- 
gorithms for reconstructing the corresponding mixtures, they do give hope that 
these problems may be tackled successfully. 

Beyond the identifiability question, there seems to have been little rigorous 
work on reconstructing phylogenetic mixture models. One positive result is the 
case of the molecular clock assumption with across-sites rate variation [SSH94], 
although no sequence-length requirements are provided. There is a large body of 
work on practical reconstruction algorithms for various types of mixtures, notably 
rates-across-sites models and covarion-type models, using mostly likelihood and 
bayesian methods. See e.g. [Fel04] for references. But the optimization prob- 
lems they attempt to solve are likely NP-hard [CT06, Roc06]. There also exist 
many techniques for testing for the presence of a mixture (for example, for test- 
ing for rate heterogeneity), but such tests typically require the knowledge of the 
phylogeny. See e.g. [HR97]. 

Here we give both identifiability and reconstruction results. The proof of our 
main results relies on the construction of a clustering statistic that discriminates 
between distinct phylogenies. A similar approach was used recently in [MRU]. 
There, however, the problem was to distinguish between phylogenies with the 
same topology but different branch lengths. In the current work, a main technical 
challenge is to analyze the simultaneous behavior of such a clustering statistic on 
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distinct topologies. A similar statistic was also used in [SS06] to prove a spe- 
cial case of Theorem 2 below. However, in contrast to [SS06], our main result 
requires that a clustering statistic be constructed based only on data generated 
by the mixture — that is, without prior knowledge of the topologies to be distin- 
guished. Finally, unlike [MRU] and [SS06], we consider the more general GTR 
model. 

2 Definitions and Results 
2.1 Basic Definitions 

Phylogenies A phylogeny is a graphical representation of the speciation history 
of a group of organisms. The leaves typically correspond to current species. Each 
branching indicates a speciation event. Moreover we associate to each edge a 
positive weight. This weight can be thought roughly as the amount of evolution- 
ary change on the edge. More formally, we make the following definitions. See 
e.g. [SS03]. Fix a set of leaf labels X = [n] = {l,...,n}. 

Definition 2.1 (Phylogeny). A weighted binary phylogenetic X-tree (or phytogeny) 

T = (V, E; 0; w) is a tree with vertex set V, edge set E, leaf set L with \L\ = n, 
and a bijective mapping : X — >■ L such that: 

1. The degree of all internal vertices V — L is exactly 3. 

2. The edges are assigned weights w : E — > (0, +oo). 

We let Ti[T] = (V, E\ 0) be the leaf-labelled topology ofT. 

Definition 2.2 (Tree metric). A phylogeny T = (V, E\ 0; w) is naturally equipped 
with a tree metric d T : X x X — > (0, +oo) defined as follows 



where Pathr(w, v) is the set of edges on the path between u and v in T. We will 
refer to ^(a, b) as the evolutionary distance between a and b. In a slight abuse of 
notation, we also sometimes use dx(u,v) to denote the evolutionary distance as 
above between any two vertices u,vofT. 

We will restrict ourselves to the following standard special case. 





w, 



eePath T (X(a),X(fe)) 



3 



Definition 2.3 (Regular Phylogenies). Let < / < g < +00. We denote by Yj> g 
the set of phylogenies T = (V, E; 0; w) with n leaves such that f < w e < g, 
Ve G E. We also let Yf >g = |J n >i ^/™] 

GTR model A commonly used model of DNA sequence evolution is the fol- 
lowing GTR model. See e.g. [SS03]. We first define an appropriate class of rate 
matrices. 

Definition 2.4 (GTR Rate Matrix). Let 1Z be a set of character states with r = 
\1Z\. Without loss of generality we assume that 1Z = [r]. Let n be a probability 
distribution on 7Z satisfying tt x > for all x G 1Z. A General Time-Reversible 
(GTR) rate matrix on 1Z with respect to stationary distribution n is an r x r real- 
valued matrix Q such that: 

1. Q xy > Ofor all x 7^ y G 1Z. 

2 - Y. y en Q*y = °> f or al1 x e 

3. -KxQxy = K y Q yx ,for all x,y G 1Z. 
By the reversibility assumption, Q has r real eigenvalues 

= Ai > A 2 > • • • > A r . 
We normalize Q by fixing A 2 = —1. 

Definition 2.5 (GTR Model). Consider the following stochastic process. We are 
given a phylogeny T = (V, E; 0; w) and a finite set 1Z with r elements. Let n be 
a probability distribution on 1Z and Q be a GTR rate matrix with respect to ir. 
Associate to each edge e G E the stochastic matrix 

M(e) = exp (w e Q) . 

The process runs as follows. Choose an arbitrary root p G V. Denote by E± the 
set E directed away from the root. Pick a state for the root at random according to 
ir. Moving away from the root toward the leaves, apply the channel M(e) to each 
edge e independently. Denote the state so obtained sy = (s v ) ve v- In particular, 
sl is the state at the leaves, which we also denote by sx- More precisely, the joint 
distribution of sy is given by 

MM = M s p) II t M ( e )L» s » • 

e=(u,v)eE i 
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For W C V, we denote by jiw the marginal of jiy at W. Under this model, the 
weight w e is the expected number of substitutions on edge e in the continuous- 
time process. We denote by T>[T, Q] the probability distribution ofsy. We also let 
T>i [T, Q] denote the probability distribution of 

SX=(s X (a)) aeX - 

More generally, we consider k independent samples {sy}f =1 from the model 
above, that is, s v , . . . , s v are i.i.d. V[T, Q]. We think of (s* )f =1 as the sequence at 
node v £ V. Typically, 1Z = {A, G, C, T} and the model describes how DNA se- 
quences stochastically evolve by point mutations along an evolutionary tree under 
the assumption that each site in the sequences evolves independently. When con- 
sidering many samples {sy}f =1 , we drop the subscript to refer to a single sample 
s v . 

Mixed model We introduce the basic mixed model which will be the focus of 
this paper. We will use the following definition. We assume that Q is fixed and 
known throughout. 

Remark 2.1 (Unknown rate matrix). See the concluding remarks for an extension 
of our techniques when Q is unknown. 

Definition 2.6 (©-Generic Mixture). Let Qbe a positive integer. In the 6-Generic 
Mixture (GM) model, we consider a finite set of phylogenies 

T = {T e = (Ve,E e ;<j> 9 -w e )}1 =1 , 

on the same set of leaf labels X = [n] and a positive probability distribution v = 
{ve)® =1 on [©]. Consider k i.i.d. random variables N 1 , . . . , N h with distribution 
v. Then, conditioned on N 1 , . . . , N k , the samples {sx}i=i generated under the 9- 
GM model (T, v) are independent with conditional distribution s 3 x ~ T>i [T N j , Q], 
j — 1, . . . , k. We denote by Pj[(T, u), Q] the probability distribution of s l x . We 
will refer to T e as the ^-component of the mixture (T, v). 

We assume that is fixed and known throughout. As above, we drop the 
superscript to refer to a single sample s x with corresponding component indicator 
N. To simplify notation, we let 

d Te =d e , V0G[6]. 
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Some notation We will use the notation [n] 2 = {(a,b) G [n] x [n] : a < b}, 

[n]= = {(a, a)}oe[n]> and = [n] 2 — [n]=. We also denote by [n]^ the set of 
pairs (ai,6i), (a 2 ,b 2 ) G [n] 2 ^ such that (ai,6i) 7^ (02,62) (as pairs). We use the 
notation poly(n) to denote ^(n 6 *) fro some C > 0. 

2.2 Main results 

We make the following assumptions on the mutation model. 

Assumption 1. Let < f < g < +00, and u > 0. The following set of assump- 
tions on a Q-GM model (T, 1/) wz7/ Z>e denoted by A[f, g, v\: 

1. Regular phylogenies: T e G Y /)9 ,V0 G [0]. 

2. Minimum frequency: u e > z/, V# G [©]. 

We denote by GM.Q[f,g,v,n] the set of Q-GM models on n leaves satisfying 
A[/,£, id- 
Remark 2.2 (No minimum frequency). See ?/ze concluding remarks for an exten- 
sion of our techniques when the minimum frequency assumption is not satisfied. 

Tree identifiability Our first result states that, under A[/, g, v\, 6-GM models 
are identifiable — except for an "asymptotically negligible fraction." To formalize 
this notion, we use the following definition. Note that GM e [/, g, v_, n] is a com- 
pact subset of a finite product of metric spaces [BHV01] which we equip with its 
Borel a-algebra. 

Definition 2.7 (Permutation-invariant measure). Let 

ACGM e (f,g,K,n), 

be a Borel set. Given permutations II = {il^l^e] of X, we let 

n[T] = {Tl e [T e ]} ee[ei] = {(V e , E e ; (f) e o Tl e ; w e )}ee[&h 

where o indicates composition, and 

A n = {(!» G GM 6 (f,g,u,n) : (U[t\,u) G A}. 

A probability measure A on GM e (/, g, v_, n) is permutation-invariant if for all A 
and II as above, we have the following: 

\[A] = \[A n ]. 
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Remark 2.3. Alternatively one can think of a permutation-invariant measure as 
first picking unlabeled trees, branch weights and mixture frequencies according 
to a specified joint distribution; and then labeling the leaves of each tree in the 
mixture independently, uniformly at random. Note that the independent labeling of 
the trees is needed for our proof. It ensures that the phylogenies in the mixture are 
typically, "sufficiently distinct." Generalizing our results, possibly in a weaker 
form, to mixtures of "similar" phylogenies is an important open problem. (For 
instance, one may want to consider the case where the same uniform labeling is 
applied to all trees in the mixture.) 

For two 0-GM models (T, v) and (T = {T^} ee[e] ,u'), we write 

(T, V) *> (T>')> 

if there is no bijective mapping h of [0] such that 

r l [Te]=T l [T' m l We [6]. 

In words, (T, v) and (T, v') are not equivalent up to component re-labeling. 

Theorem 1 (Tree Identifiability). Fix < / < g < +oo, and v_ > 0. Then, there 
exists a sequence ofBorel subsets 

A n CGM e (f,g,u,n), n > 1, 

such that the following hold: 

1. For any sequence of permutation-invariant measures X n , n > 1, respec- 
tively on GMe(/, g, v, n), n > 1, we have 

\ n [A n ] = 1 - o n (u, f,g), 

as n — > oo. Here o n (u, f, g) indicates convergence to as n — >■ oo for fixed 
v, f, 9- 

2. For all 

n>l 

we have 

V l [(T,is),Q]^V l [(T',is'),Q}. 

Remark 2.4. As remarked above, our proof requires that the phylogenies in the 
mixture are "sufficiently different." This is typically the case under a permutation- 
invariant measure. Roughly speaking, the complements of the sets A n in the pre- 
vious theorem contain those exceptional instances where the phylogenies are too 
"similar." See the proof for a formal definition of A n . 
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Tree distance We also generalize to GTR models a result of Steel and Szekely: 
phylogenies are typically far away in variational distance [SS06]. The techniques 
in [SS06] apply only to group-based models and other highly symmetric models. 
(See [SS06] for details.) Let ||-|| TV denote total variation distance, that is, for two 
probability measures V, V on a measure space J 7 ) define 

\\V - T?'|| TV = sup \V(B) - V(B)\ . 

Theorem 2 (Tree distance). Let {Ai}n be as in Theorem I where = 2 and 
y_ — 1/2 (in which case we necessarily have v = (1/2, 1/2)). Then for all 

(T» G \jA n , 

n>l 

we have 

\\V l [T 1 ,Q]-V l [T 2 ,Q]\\ TY = 1 -o n {\). 
Remark 2.5. Note that v plays no substantive role in the previous theorem. 

Tree reconstruction The proof of Theorems 1 and 2 rely on the following re- 
construction result of independent interest. We show that the topologies can be 
reconstructed efficiently with high confidence using polynomial length sequences. 
Recall that k denotes the sequence length. 

Theorem 3 (Tree Reconstruction). Let {A n } n be as in Theorem 1. Then for all 

n>l 

the topologies of (T, v) can be reconstructed in time polynomial in n and k 
using polynomially many samples (i.e., k is polynomial in n) with probability 
1 — o n (v, f, g) under the samples and the randomness of the algorithm. 

The rest of the paper is devoted to the proof of Theorem 3 which implies 
Theorems 1 and 2. 

2.3 Proof overview 

The proof of Theorem 3 relies on the construction of a clustering statistic that 
discriminates between distinct phylogenies. 
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Clustering statistic Fix < / < g < +00 and u > 0. Suppose for now that 
6 = 2 and let A be a permutation-invariant probability measure on GMe[/, g,v,n\. 
It will be useful to think of A as a two-step procedure: first pick unlabeled, 
weighted topologies; and second, assign a uniformly random labeling to the leaves 
of each tree. Pick a 0-GM model (T, v) according to A. We will denote by P A and 
E A probability and expectation under A. Similarly, we denote by and E; (resp. 
P.4 and EX) probability and expectation under (T, v) (resp. under the randomness 
of our algorithm). 

Let z = (z x ) r x=1 be a (real-valued) right eigenvector of Q corresponding to 
eigenvalue A 2 = — 1 and normalize z so that 

r 

x=l 

(Any negative eigenvalue could be used instead.) Consider the following one- 
dimensional mapping of the samples [MP03]: for alii = 1, . . . , k and a £ X 

< = Zsi- (1) 
Recall that we drop the superscript when referring to a single sample. It holds that 

EfcalN = 9} = 0. (2) 

Moreover, by a computation in [MP03], one has 

q e (a,b) = E,[<7 <7 6 |JV = 9} - E^N = 9}E[a b \N = 9} 
= E l [a a a b \N = 6] 

= e - de{a ' b \ (3) 

and 



q(a, b) = Ei[a a a b ] - E^jE^t] = E,^^] = ^ v B e~ d ^ h \ (4) 

e=i 

We use a statistic of the form 



(a,b)er 



where T C [n\ 2 ,. For U to be effective in discriminating between Ti and T 2 , we 
require the following (informal) conditions: 
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CI The difference in conditional expectations 



A = \Ei\U\N = 1] -Ei[U\N = 2] | 

is large. 

C2 The statistic U is concentrated around its mean under both T> t [T 1: Q] and 
V t [T 2 ,Q]. 

C3 The set T can be constructed from data generated by the mixture (T, v). 

A U satisfying C1-C3 could be used to infer the hidden variables N 1 , . . . ,N k 
and, thereby, to cluster the samples in their respective component. 

Prior work In [MR1 1], it was shown in a related context that taking T = [n]^ 
is not in general an appropriate choice as it may lead to a large variance. Instead, 
the following lemma was used. 

Claim (Disjoint close pairs [SS06]. See also [MRU].). For any T e Y^, there 
exists a subset T C [n] 2 ^ such that the following hold: 

1. |T| = n(n). 

2. V(a,6) G T, d T (a,b) < 3g. 

3. V(ai,6i) 7^ (02,^2) £ T, the paths Pathr(ai, 61) and Pathr(a2, 62) fl^ 
edge-disjoint. We will say that such pairs are T-disjoint. 

For special Q matrices, it was shown in [SS06] and [MRU] that such a T 
for T — Ti, say, can be used to construct a clustering statistic (similar to (5)) 
concentrated under T>i[Ti,Q]. In particular, the Ti-disjointness assumption above 
implies the independence of the variables a^a^ and o a , 2 o\y^ under the Q matrices 
considered in [SS06, MRU]. Moreover, Steel and Szekely [SS06] proved the 
existence of a further subset that is also T 2 -disjoint, but their construction requires 
the knowledge of T 2 . Here we show how to satisfy conditions C1-C3 under GTR 
models. 
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High-level construction We give a sketch of our techniques. Formal statements 
and full proofs can be found in Sections 3, 4 and 5. For a > 0, let 



T afi = {(a, 6) G [n]J : d*(a,&) < a}, 

and 

<?e[0] 

Because the variables N 1 , . . . , N k are hidden we cannot infer T a £ directly from 
the samples, for instance using (3). Instead: 



[Step 1 ] Using (4) and the estimator 

1 k 



i=i 



we construct a linear-sized set satisfying 

T 4s CT'C T Cc , 
for an appropriate constant C c . (See Lemma 4.1.) 



Define 

T e = V n T Cc , e . 

For general GTR rate matrices, T e -disjointness of (ai,&i), (a 2 ,b 2 ) G T' e does 
not guarantee independence of a ai a bl and o- a2 <T 62 under T>i[T e ,Q]. Instead, we 
choose pairs that are far enough from each other by picking a sufficiently sparse 
random subset of T'. (See Lemma 3.6.) We say that (ai, 6i), (a 2 , b 2 ) £ T' e are T e - 
far if the smallest evolutionary distance between {ax, &i} and {a 2 , b 2 } is at least 
Cf log logn for a constant C/ > to be determined. 



[Step 27 We take a random subset T" of T' with 

|T"| = e(logn). 

(See Lemma 4.2.) 



Denoting 
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we show that all (ai,&i) ^ (a 2 ,b 2 ) G T' g ' are T^-far. Under a permutation- 
invariant A, a pair (a, 6) G T a i is unlikely to be in T Q)2 . In particular we show 
that, under A, the intersection of T" and T 2 is empty. In fact, a pair (a, 6) G T aj i 
is likely to be such that d 2 (a, b) is large. We say that (a, b) G is T e -stretched 
if rfe(a, 6) > C st log log n for a constant C st > to be determined. We show that 
all (a, 6) G T" are T 2 -stretched. (See Lemma 3.5.) 
To infer Tg, we consider the quantity 

1 k 

r(ci,c 2 ) = j_Yl Hi a iy a y b2 - Q.( a ii b i)q(a 2 ,b 2 )] , 

1=1 

forci = (aiA) ^ c 2 = (a 2 ,b 2 ) G [n]J. We note that if (a, 6) G T" is T 2 - 
stretched then 

E,[<r a <T 6 |JV = 2] w Ej[<r a |iV = 2]E l [ix b \N = 2] = 0, 

and 

g(a, 6) w i/i5i(a,6). 

There are then two cases: 

I. If c\ = (ai,6i) 7^ c 2 = (a 2 ,b 2 ) G T" (and similarly for T 2 '), they are 
Ti-far and each is T 2 -stretched. Moreover we show that (ci, c 2 ) is T 2 -far. 
Therefore, 

g(ai,6i) w ^i?i(ai,6i), q(a 2 ,b 2 ) w Viqi(a 2 ,b 2 ). 
and we show further that 

E; [o- ai o- 6l o- a2 o- 62 ] w i/iE/[(j ai (T 6l |A^ = l]Ez[<r a2 <X6 2 |JV = 1] 

+i/ 2 E,[<r 0l |iV = 2]E,[<7 6l |JV = 2]E,[cr a2 |iV = 2]E,[<7 b2 |7V = 2] 
w ^i?i(ai,6i)gi(a 2 ,6 2 ). 

So 

r (ci, c 2 ) w - z/i)gi(ai, &i)gi(a 2 , 6 2 ) > 0. 

II. On the other hand, if c\ = (ai, &i) G Y" and c 2 = (a 2 , fe 2 ) G Y2, then c\ is 
T 2 -stretched and c 2 is Ti -stretched. Moreover we show that (ci, c 2 ) is both 
Ti-far and T 2 -far. Therefore, 

g(ai,6i) ~ ^i?i(ai,6i), q(a 2 ,b 2 ) ps v 2 q 2 (a 2 ,b 2 ). 
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and we show that 

E, [a ai a bl a a2 a b2 ] « v^a^a^N = l]E,[<7 a2 |7V = l]E,[<j 62 |7V = 1] 

+i/ 2 E,[<r ai |iV = 2}E l [a bl \N = 2]E,[<7 a2 <7 b2 |7V = 2] 

« 0. 



So 

r(ci,c 2 ) « -^igi(ai,6i)i/2g 2 (a2,&2) < 0. 

(See Lemma 3.7.) 

The argument above leads to the following step. 

/Step 3] For all pairs ci = (a±,bi) and c 2 = (a 2 ,6 2 ) in T", we 
compute f(ci, c 2 ). Using Cases I and II, we then infer the sets Y" 
and T 2 '. We form the clustering statistics 

1 el (o,6)eT» 

for = 1, 2 and i = 1, . . . , k. (See Lemma 4.3.) 
By the arguments in Cases I and II above, we get that for (a, b) e T" 

Ei[o- a o- 6 |iV = 1] « i/igi(a,6), 

whereas 

Ei[a a( T 6 |iV = 2] « Ei[<r a |iV = 2]Ej[cr 6 |jV = 2] « 0, 
so that (dropping the superscript to refer to a single sample) 

E,[Wi|7V = 1] > C A , 

whereas 

E,[Wi|iV = 2] <C A , 

for a constant Ca > to be determined later. (See Lemma 3.8.) Moreover, the 
properties of Y^ discussed in Cases I and II allow us to prove further that Lie is 
concentrated around its mean. (See Lemma 3.9.) This leads to the following step. 
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[Step 4] Divide the samples i = l,...,k into two clusters K x and 
K 2 , according to whether 

U[ > C A or Ui > C A , 

respectively. (See Lemma 5.1.) 

Once the samples are divided into pure components, we apply standard recon- 
struction techniques to infer each topology. 

[Step 5] For 9 = 1,2, reconstruct the topology Ti[T e ] from the sam- 
ples in K e . (See Lemma 5.3.) 



General When > 2, we proceed as above and construct a clustering statis- 
tic for each component. 



3 Main lemmas 

In this section, we derive a number of preliminary results. These results are also 
described informally in Section 2.3. 

Fix a GTR matrix Q and constants > 2, < f < g < +oo and u > 0. 
Let A be a permutation-invariant probability measure on GMe[/, g, v, n\. Pick a 
0-GM model (T, v) according to A and generate k independent samples {s l x }^ =1 
from T>i[(T, v), Q]. We work with the mapping {a x }i =1 defined in (1): for all 
i = 1, . . . ,k and a E X 

° % a = Z si a , 

where z = (z x ) r x=1 is a right eigenvector of Q corresponding to eigenvalue A 2 = 
— 1, normalized so that 

r 

x=l 

We will denote by P A and E A probability and expectation under A. Similarly, we 
denote by P; and E/ (resp. and E_4) probability and expectation under (T, v) 
(resp. under the randomness of our algorithm). 

Throughout we assume that the number of samples is k = n Ck for some C k > 
to be fixed later. 
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3.1 Large-sample asymptotics 

Denoting JC = [k], let Kq C JC be those samples coming from component 9, that 
is, 

/Ce = {i G K : iV* = 9}. 
Lemma 3.1 (Size of IC e ). Under P b for any Cjc > 1 we have 

C K l < l 3< Ck, 



for all 9 G [0], except with probability exp(—Q(n Ck )). 

Proof. Recall that y_ < u e < 1 — v_. Using Lemma A.l with m = k and 

C = u e km8,x{l - C K \ C K -1} = v e k{C K - 1), 

gives the result. □ 
Consider the estimators 

Qe(a,b) = a a< 



i£tCg 



and 

k 



i=i 

Let 

qe (a,b) = e- d ^' b \ 

and 

e 

q(a,b) = ^2u e q e (a,b). 

9=1 

Lemma 3.2 (Accuracy of q). Fix < C q < Ck/2. Under P/, we have 

\q(a,b) - q(a,b)\ < n~ c \ 

and 

\q e (a,b)-q e (a,b)\ < n~ c \ 
for all 9 G [0] and all (a, b) G [n]^ except with probability exp(— poly(n)). 
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Proof. For each (a, b) e g(a, 6) is a sum of independent variables. By 
Lemma A.l, taking m = k,t = k~ l maxj \zi\ 2 , ( = n~ Cq we have 

\q(a, b) - q(a,b)\ < n~ c \ 

except with probability 2 exp(— Vt{n Ck ~ 2Cq )). Note that there are at most n 2 ele- 
ments in [n]^ so that the probability of failure is at most 

2n 2 exp(-n(n Ck ~ 2C *)) = exp(-n(n Ck - 2Cq )) . 

Using Lemma 3.1, the same holds for each 9. The overall probability of failure 
under P z is exp(-{l(n Ck ~ 2Cq )) . □ 

Following the same argument, a similar result holds for 

1 k 

r(c!,c 2 ) = -J2 [«« -9(01,61)9(02,62)] , 

i=l 

forci = (ai,6i) 7^ c 2 = (a 2 ,6 2 ) G [n]^. Let 

r(Ci,C 2 ) = E;[f(Ci,C 2 )]. 

Lemma 3.3 (Accuracy of f). Under F b we have 

\r(c 1 ,c 2 ) - r(ci,c 2 )| < n~ c % 
forallci = (ai,6i) 7^ c 2 = (a 2 ,6 2 ) G [n]^ except with probability exp(— poly(n)). 

3.2 Combinatorial properties 

For a > 0, let 

= {(a, b) e [n] 2 ^ : d e (a,b) < a}, (6) 

and 

T Q = |J T Q ,,. (7) 

ee[e] 

The lower bound below follows from a (stronger) lemma in [SS06]. See also [MRU]. 
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Lemma 3.4 (Size of T a e ). For all a > and 9 G [0] 

\n < \T a0 \ < 2LfJ n . 

/n particular, 

-n < ITJ < 02LfJ n . 
4 

Proof. For a G X and a > Ag, let 

# Q (a) = {i)G V : dg((/)g(a),v) < a}. 

Since T e is binary, there are at most 2 LtJ vertices within evolutionary distance a, 
that is, 

\B a (a)\ <2LfJ. 

Restricting to leaves gives the upper bound. 
Let 

T a = {a G [n] : rf e (a, 6) > a, Wb G [n] — {a}}, 

that is, T a is the set of leaves with no other leaf at evolutionary distance a in Tg. 
We will bound the size of r a . Note that for all a, b G T a with a ^ b we have 
#«(a) fl Z3|(6) = by the triangle inequality. Moreover, it holds that for all 

a G T a 

|H f (a)| > 2LSJ, 

since T e is binary and there is no leaf other than a in £>| (a). Hence, we must have 

as there are 2n — 2 nodes in Tg. Now, for all a T a assign an arbitrary leaf at 
evolutionary distance at most a. Then 

where we divided by 2 to avoid double-counting. The result follows from the 
assumption a > Ag. □ 
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Let C c > 4g, C f > 0, and C st > Cf to be fixed later. 

Definition 3.1 (Tg-quasicherry). We say that (a, b) G [n]^ is a Tg-quasicherry if 

(a, b) G T Cc ,e- 

Definition 3.2 (Tg-stretched). We say that (a, b) G [n]^ is Tg-stretched ifd e (a, b) > 

C st log log 71. 

Definition 3.3 (Tg-far). We say that c\ = (ai,6i) 7^ c 2 = (02,62) £ are 
Te-farjJ 

^(ci,c 2 ) = min{d (xi,x 2 ) : Xi G {ai,&i},z 2 e {a 2 ,& 2 }} > C/ log log n. 
Let T' be any subset satisfying 

T 4s C r C T Cc , (8) 

and let 

T e = V n T Cc , e . (9) 

Let Cf p > to be fixed later. Keep each (a, 6) G T Cc independently with proba- 
bility 

_C»logn 



to form the set T'^ c and let 



T" = T'nT'4. 



Let < C sp < C+ < +00 be constants (to be determined). 
Definition 3.4 (Properly sparse). A subset T 4g C T" C T Cc with 

t'o = t" n T Cc ,e, ee[e], 

is properly sparse if it satisfies the following properties: For all 6 G 

1. We have C~ log n < |T£| < C+logra. 

2. All ci = (ai, 61) 7^ c 2 = (a 2 , 6 2 ) G T" are Tg-far. 

3. All pairs in T' d ' are Tq* -stretched for 6' 7^ 6. 



18 



Let 

and 

T^ = T 49 nr^, oe[e\. 

Lemma 3.5 (Sparsification). There exist constants < C~ < C+ < +00 such 
that, under F^\, the set Tq as above satisfies the following properties, except 
with probability l/poly(n): For all 9 e [Q], 

1. We have C~ logn < \T^ fi \ and \V^ e \ < C+ logn. 

2. All ci = (ai, 61) 7^ c 2 = (a 2 , 62) £ are T e -far. 

3. All pairs in Tq e are T e r -stretched for 6' ^ 9. 

In particular, the set T" as above is properly sparse. Moreover, the claim holds 
for any C~ p > by taking C% p > large enough. 

Proof. Intuitively, Part 2 follows from the sparsification step whereas Part 3 is a 
consequence of the permutation-invariance of A. We give a formal proof next. 
For Part 1, we use Lemma A. 2 in the appendix. Take 

1 C v log n 

z Cyogn < M 49 = _5L_k_|t^|, 

and 

Cf„ log n , I c c I 

With 5_ = 1 /2, 5+ = 5, we have 



VA\VIJ < (1 - 5_)M 4ff ] < e"^-/ 2 = 

and 

^[|T£J > (1 + S + )M Ce ] < 2-(^ + )^c = _i_ 
The first part follows from the choice 

sp 8 ' 
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and 

C + = QC P 2^J. 

w sp sp 

For the second part, let c\ = (ai, 61) be a pair in T^, . Let 5 be the collection 
of pairs c 2 = (a 2 ,b 2 ) 7^ Ci in the original set T Cc that are within evolutionary 
distance Cf log log n of ci in T e , that is, 



rf(ci,c 2 ) < C f log log n. 
Note that the number of leaves within evolutionary distance Cf log log n from a\ 

I Cf log log n I 

or &! is at most 2 • 2 L 1 J . Moreover, each such leaf can be involved in at most 
pairs, since any pair in T Cc must be a T e /-quasicherry for some & G [©] 
and the number of leaves at evolutionary distance C c from a vertex in a tree in 

\Qc\ 

Yf t9 is at most 2 L / J . Hence 

I Cy log log n I \ C \ 

\S\<2- 2l 7 J • 02LxJ = O(logn). 

Therefore the probability that any c 2 G S remains in Tq is at most O (log 2 n/n). 
Assuming Part 1 holds, summing over Tq c , and applying Markov's inequality, we 
get 

Pa[|ci ^ c 2 G T'4 : Cl , c 2 are not T fl -far| > 1] = O f 1 ^) + — 

\ n J poly(n) 

This gives the second part. 

For the third part, consider a T^-quasicherry (a, b). Thinking of A as assigning 
leaf labels in T e > uniformly at random, the probability that b is within evolutionary 
distance C st log log n of a in T e > is at most 

I C a t log log " I 

2L f -I /loen\ 

P A [(a, 6) is not 7>- stretched! < = O , 

n \ n J 

where the numerator in the second expression is an upper bound on the number of 
vertices at evolutionary distance C st log log n of a in T e >. Summing over all pairs 
in Tq e and assuming the bound in Part 1 holds, the expected number of pairs in 
t"c c £ that are not T^-stretched is 0(log 2 n/n). By Markov's inequality, 

P AA [|{M) G T£ e>fl : (a, b) isnotT e ,-stretched}| > l] < O (^)+^ 1 ^y 
This gives the third part. □ 
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3.3 Mixing 



We use a mixing argument similar to [Mos03]. Let 

Qmin minQ;^, 

which is positive by assumption. We think of Q as acting as follows. From a state 
x, we have two type of transitions toy ^ x: 

(i) We jump to state y at rate Qmin > 0. 

(ii) We jump to state y at rate Q xy — Qmin > 0. 

Note that a transition of type (i) does not depend on the starting state. Hence if 
V is a path from u to v in Tg, N = 9, and a transition of type (i) occurs along V 
then a u is independent of a v . The probability, conditioned on N = 9, that such a 
transition does not occur is e~ de(u ' t ')( r ~ 1 ) Qmin . 

Let T" C [n]^ be a properly sparse set. We show next that pairs in T" are 
independent with high probability. We proceed by considering the paths joining 
them and arguing that transitions of type (i) are likely to occur on them by the 
combinatorial properties in Definition 3.4. Formally, fix 9 e [0] and consider two 
pairs ci = (ai, b\) ^ c 2 = (a 2 , b 2 ) G T". By Definition 3.4, c\ and c 2 are T^-far. 
There are three cases without loss of generality: 

1. Ci, c 2 are Tg-quasicherries. In the subtree of T e connecting {ai, b±, a 2 , b 2 }, 
called a quartet, the paths Pa,ih Te (a 1: b 1 ) and Path Te (a 2 , b 2 ) are disjoint. 
This is denoted by the quartet split aibi\a 2 b 2 . Let V 6 [c\, c 2 ] be the internal 
path of the quartet. Note that by Definition 3.4 the length of V 9 [ci,c 2 ] is 
at least C/ log log n — 2C C . Denote by Pfjci,c 2 ] the subpath of V e [c\,c 2 \ 
within evolutionary distance \Cf log log n of c\. 

2. c\ is a Tg-quasicherry and c 2 is Tg-stretched. Consider the subtree of Tg 
connecting {ai, b±, a 2 }, called a triplet, let u be the central vertex of it. Let 
V e [cx, a 2 ] be the path connecting u and a 2 . Note that by Definition 3.4 the 
length of V e [ci , a 2 ] is at least C / log log n — C c . Denote by V e Cl [ci , a 2 ] the 
subpath of V e [ci, a 2 ] within evolutionary distance |C/ log log n of ci. Sim- 
ilarly, denote by Pf 2 [ci,a 2 ] the subpath of P e [ci,a 2 ] within evolutionary 
distance |C/ log logn of a 2 . 

3. Ci,c 2 are Tg- stretched. Let 'P 6, [ai,a 2 ] be the path connecting ai and a 2 . 
Note that by Definition 3.4 the length of V e [ai, a 2 ] is at least Cj log log n. 
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Denote by V e ai [a-y , a 2 ] the subpath of V e [ai , a 2 \ within evolutionary distance 
|C/ log log n of ai. Similarly, let P [ai, 61] be the path joining a\ and b\ 
and let V e ai [a± : b\] be the subpath of V e [ai, b±] within evolutionary distance 

\C st \og\ogn> |C/loglognof a ± . 

Condition on N = 9. For each c\ = (a±,bi) G Tg, let £ e Ci be the event: 

Each subpath V e ci [ci, c 2 ], c 2 7^ Ci G T' e ', and each subpath T 3 ^ [ci, a 2 ], 
c 2 = (02, ^2) G T" — Tg, undergo a transition of type (i) during the 
generation of sample a x . 

Similarly, for each c x = (a 1} b x ) G T" - T" e , let £ e ci = £ e ai n £j where 5* is the 
event (and similarly for ££): 

Each subpath T'fJ^, ai], c 2 G T^, each subpath P^ i [a 1 ,a 2 ], c 2 = 
(02,62) G T" — T' e ' with ci 7^ c 2 , as well as subpath T^Ja^&i] un- 
dergo a transition of type (i) during the generation of sample a x . 

Note that, under 8 e ci , the random variable a ai a bl is independent of every other 
such random variable in T". Moreover, in the case c\ G T" — T' e ', then further 
a ai is independent of a bl . The next lemma shows that most of the events above 
occur with high probability implying that a large fraction of a ai a bl s are mutually 
independent. 

Lemma 3.6 (Pair independence). Let T" C [n]^ be a properly sparse set. Condi- 
tioned on N = 9, let 

I = {ci G T" : 5* holds}. 

For any < ex < 1 and Cx > 0, ?/zere exist C/, C st > Cf and C~ p > large 
enough so that the following holds except with probability n~ Cl under Pi, 

\1\ > (l-ei)|T"|. 

Proof. Condition on N = 9. Note that the £^s are mutually independent because 
the corresponding paths are disjoint by construction. By a union bound over T", 
for all ci G T", 

Pi [(S e Cl ) C \N = 9] < 2Ct p logn-e-^ c f loglogn - 2Cc){r - 1)Q ^ 



poly(logra)' 
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for Cf large enough. Applying Lemma A. 2 with 

M = \r"\-p l [(eZ i y\N = e], 

and 5 + > 2e such that 

(l + 5 + )M = e x \V'\ > £x C-logn, 

we get 

P,[|T"-X| >£x|T // |] <2^ T "I 



n c x > 

by taking C~ p large enough in Definition 3.4. □ 

We use the independence claims above to simplify expectation computations. 

Lemma 3.7 (Expectation computations). Let Y" C [n]^ be a properly sparse set. 
The following hold: For all 9 ^ 9' e [©], 

7. V(a,6)eT; 

q e (a,b) > e~ Cc . 

2. V(o, 6) G T" - T£ 

ge(a,6) 



poly (log n) 

3. V( fl ,i)eT; 

q(a,b) = u e q e (a,b) 



poly(logn) 
4. Vc 1 = (a 1 ,6 1 )^c 2 = (a 2 ,6 2 )eTg, 

r(ci,c 2 ) = u e (l - u e )q e (a 1 ,b 1 )q e (a 2 ,b 2 ) 



poly(logn) 



> ^(1 - ^)e~ 2Cc > 0. 
5. Vci = (ai,&i) G Y£,c 2 = (a 2 ,&2) e Tg,, 

r(ci,c 2 ) = -^(01,61)1/^(02,62) + 
< --z/e" 2Cc < 0. 
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poly(logn) 



Proof. Parts 1 and 2 follow from the fact that q e (a, b) = e - dsia ' b \ d e (a, b) < C c 
for all (a, b) e T% and d e (a,b) > C st \og\ogn for all (a, b) e T" - T' e ' from 
Definition 3.4. Part 3 follows from Parts 1 and 2. 

For Part 4, let c L = (a 1 , &i) ^ c 2 = (a 2 , b 2 ) G T^. Note that 

^[^^^^1^ = 0,4,4] = E l [a ai a bl \N = e]E l [a aa a ba \N = 6] 

= qe(a 1 ,b 1 )q e (a 2 ,b 2 ), 

and 

E,[ < 7 Ol <7 6l <7 aa <T 62 |JV = , ,5j,^] = ^K|iV = ^E,K|iV = ^ 

xE,[<7 aa |JV = 0']E,[<7 62 |JV = 0'] 

= 0, 

by (2), so that 

Ez[<7 ai <7 6l <7 a2 <7 62 ] = veq$(ai, h)qe(a 2 , b 2 ) + 



poly(logn) ' 

from (10). Then Part 4 follows from Lemma 3.2 and Part 3. 

For Part 5, let a = (aiA) e T^ca = (a 2 ,6 2 ) e Tg,. Let 0" ^ 0,0'. Note 

that 

E l [a ai a bl a a2 a b2 \N = 9,£ d Cl ,S e C2 } = E^a^N = 9] 

xE l [a a2 \N = e]E l [a ba \N = e] 

= o, 

and 

Efca^Ga^N = 9', = Ei[a ai a bl \N = 9'] 

xEfca^N = e']E l [a b2 \N = 9'] 

= 0. 

Moreover, since c x ,c 2 £ T^,,, 

E l [a ai a bl a a2( r b2 \N = 9",S e c ';,S^] = EfcjN = 0"]E,[<7 6l |JV = 9"} 

xE^N = ^[a^N = 9"} 

= 0. 

Hence 

Ei[o- ai o- 6l o- a2 o- te ] = H — r, 

poly(logn) 

from (10). Then Part 5 follows from Lemma 3.2 and Part 3. 

□ 
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3.4 Large-tree concentration 

Let T" C [n]^ be a properly sparse set. Consider the clustering statistic 

1 61 (a,6)6T» 

We show that Ue is concentrated and separates the ^-component from all other 
components. 

Lemma 3.8 (Separation). There exists Ca > such that for 9' ^ 9 

E l [U e \N = 9] > C A , 

and 

E l [Ug\N = 9'] <C A . 
Proof. By Definition 3.4, all (a, b) e T'g are Tq> -stretched. Hence 

E l [U e \N = 6] > e~ c % 

and 

E l [U e \N = 9'] = —^ s , 

poly (log n) 

by Lemma 3.7. Taking Ca = |e _Cc gives the result. □ 

Lemma 3.9 (Concentration of Ue). For all eu > and Cu > 0, there are C/ > 
C st > Cf and C~ p > large enough such that for all 9, 9' (possibly equal) 

[\U e > - E^N = 9}\ > e u \N = 0] < -L 

Proof. Let X be as in Lemma 3.6 and let Uq be the same as Ue with the sum 
restricted to X. From Lemmas 3.5 and 3.6, conditioned on X, Uq is a normalized 
sum of 0(logn) independent bounded variables. Concentration of Uq therefore 
follows from Lemma A.l using m = fi(logn), t = 0(l/logn) and ( = \eu- 
Taking e% = \tu maxj zf and C% > Cu in Lemma 3.6 as well as C~ p > large 
enough gives the result. □ 
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4 Constructing the clustering statistic from data 



In this section, we provide details on the plan laid out in Section 2.3. 

Fix a GTR matrix Q and constants > 2, < f < g < +oo and u > 0. 
Let A be a permutation-invariant probability measure on GM e [/, g, u, n}. In this 
section, we work directly with samples {<Jx}i=i generated from an unknown 0- 
GM model (T, v) picked according to A. 

Our goal is to construct the clustering statistics {U e }f =1 from {a l x }f =1 . These 
statistics will be used in the next section to reconstruct the topologies of the model 

(T». 

4.1 Clustering algorithm 

We proceed in three steps. Let 
and 

u = 2 -ve-^. 

The algorithm is the following: 

1. (Finding quasicherries) For all pairs of leaves a,b E [n], compute q(a, b) 
and set 

V = {{a,b)e[n\l : q(a,b)>cu}. 

2. (Sparsification) Construct T" by keeping each (a, b) e T' independently 
with probability 

Cg, log » 

Ps P - n ■ 

3. (Inferring clusters) For all c\ ^ c 2 G T' compute f(ci, c 2 ) and set c\ ~ c 2 
if 

f(c 1 ,c 2 ) > 0. 

Let Tg, 9 — 1, ...,©, be the equivalence classes of the transitive closure of 

4. fFma/ Return Tg, 9 G [0]. 
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4.2 Analysis of the clustering algorithm 

We show that each step of the previous algorithm succeeds with high probability. 

Lemma 4.1 (Finding quasicherries). The set T' satisfies the following, except with 
probability at most exp(— poly(n)) under P ; .- 

T 4s CT'C T Cc . 

Proof We prove both inclusions. For all 9 G [6] and (a, b) G T^, 

qe(a,b)>e~ 49 , 

and 

g(a, 6) > ve-* 9 > 2 -ve^ 9 = u. 

By Lemma 3.2, 

g(a, b) > u>, 

except with probability exp(— poly(ra)). 

Similarly for any (a, b) G T', by Lemma 3.2, if 

2 

q(a,b) > id = -ue 9 , 

then 

q(a,b) >\ve- 49 , 



so that there is 6 G [01 with 



That is, 



and 



v e qe(a,b) > ^ue 49 



ge(a ' 6) -3eTT^)- e " 43 ' 



Ma,b)<-ln{^— f e^)^C c . 

Hence (a, b) G T Cc ,e- □ 

Lemma 4.2 (Sparsification). Assuming that the conclusions of Lemmas 4.1 hold, 
T" Z5 properly sparse, except with probabillity l/poly(n). 
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Proof. This follows from Lemma 4.1 and the choice of p sp . □ 

Lemma 4.3 (Inferring clusters). Assuming that the conclusions of Lemmas 4.1 
and 4.2 hold, we have = and there is a bijective mapping h of [0] such that 

L h(0) — L 0i 

with the choice T' = T' in Section 3.2, except with probability exp(— poly(n)). 

Proof. It follows from Lemmas 3.3 and 3.7 that ~ is an equivalence relation with 
equivalence classes T' d \ 9 = 1, . . . , 0, except with probability exp(— poly(ra)). 

□ 



5 Tree reconstruction 

We now show how to use the clustering statistics to build the topologies. The 
algorithm is composed of two steps: we first bin the sites according to the value 
of the clustering statistics; we then use the sites in one of those bins and apply a 
standard distance-based reconstruction method. We show that the content of the 
bins is made of sites from the same component — thus reducing the situation to the 
unmixed case. 
Let 

1 -c 

m = rf C % 

and 

1 2 

£x = t^ £ u max z i . 

Moreover take Cf, C st , C v , and C~ so that the lemmas in Section 3 hold. 
To simplify notation, we rename the components so that h is the identity. 

5.1 Site binning 

Let Tg, 9 e [0], be the sets returned by the algorithm in Section 4. Assume 
that the conclusions of Lemmas 4.1, 4.2, and 4.3 hold. We bin the sites with the 
following procedure. 
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1. ( Clustering statistics) For alH = 1, . . . , k and all 9 — 1, . . . , 0, compute 

1 el (a,6)eT» 

2. (Binning sites) For all 6> = 1, . . . , 0, set 

Z e = [i G [k] : U\ > C A } . 

We show that the binning is successful with high probability. 

Lemma 5.1 (Binning the sites). Assume that the conclusions of Lemmas 4.1, 4.2, 
and 4.3 hold. For any Ck, there exists Cu large enough so that, for all 6 e [0], 

K.q = K.0, 

except with probability l/poly(n). 

Proof. This follows from Lemmas 3.8. and 3.9 by a union bound over all samples. 

□ 

5.2 Estimating a distorted metric 

Estimating evolutionary distances We estimate evolutionary distances on each 
component. For all 9 G [0], let fC e be as above and assume the conclusions of 
Lemma 5.1 hold. 

1. (Estimating distances) For all 9 = 1, . . . , and a ^ b e [n], compute 

Lemma 5.2 (Estimating distances). Assume the conclusions of Lemma 5.1 hold. 
The following hold except with probability exp(— poly (n)): For all 9 e [0] and 
all a 7^ b G [n], 

\qo(a,b) - q (a,b)\ < 
Proof. The result follows from Lemma 3.2. □ 
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Tree construction To reconstruct the tree, we use a distance-based method 
of [DMR09]. We require the following definition. 

Definition 5.1 (Distorted metric [Mos07, KZZ03]). Let T = (V, E; <p; w) be a 

phylogeny with corresponding tree metric d and let r, ^ > 0. We say that d : 
X x X — > (0, +oo] is a (r, \E') -distorted metric for T or a (r, ^) -distortion of d 

1. [Symmetry] For all a,b G X, d is symmetric, that is, 

d(a, b) = d(b, a); 

2. [Distortion] d is accurate on "short" distances, that is, for all a,b G X, if 
either d(a, b) < \& + r or d(a, b) < ^ + r then 



d(a, b) — d(a, b) 



< r. 



An immediate consequence of [DMR09, Theorem 1] is the following. 

Claim (Reconstruction from distorted metrics [DMR09]). Let T = (V, E; <fi; w) 

be a phylogeny in Yj 9 . Then topology ofT can be recovered in polynomial time 
from a (r, ^-distortion d of das long as 

and 

\T/ > 5glogn. 

Remark 5.1. The constants above are not optimal but will suffice for our pur- 
poses. 

See [DMR09] for the details of the reconstruction algorithm. 
We now show how to obtain a (//5, 5g log n) -distortion with high probability 
for each component. 

Lemma 5.3 (Distortion estimation). There exist C q ,C k > so that, given that the 
conclusions of Lemma 5.2 hold, for all 9 G [©] 

de(a, b) = — In (q$(a, b) + ) , (a, b) G X x X, 

is a (//5, 5g log n) -distortion ofdg. 
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Proof. Fix# G [0]. Define 

L 2 = {(a, b) G X x X : dg(a,b) < 15g log n}, 

and 

L+ = {(a, 6) G X x X : cfo(a,6) > 12 g log n}, 
Let (a, &) € L 2 . Note that 

1 

n c 'i 



-de(a,b) > eX p(-15#logn) _ f „ . 



where the last equality is a definition. Then, taking C q (and hence C k ) large 
enough, from Lemma 5.2 we have 



d e (a,b) - dg(a,b) 
Similarly, let (a, b) G L%. Note that 



~ 5 



-de(a,b) < exp (-12^1ogn) = 



C" ' 

n « 

where the last equality is a definition. Then, taking C 9 large enough, from Lemma 5.2 
we have 

f 

dg(a,b) > 5g\ogn + -. 

o 

□ 



6 Proof of main theorems 

We can now ready to prove the main theorems. 

Proof. (Theorem 3) Let Ci, C 2 > 0. Let A n be the subset of those 6-GM models 
(T, v) in GMe[/, g, v_, n] for which Part 3 of Lemma 3.5 holds with probability 
at least 1 — n~ Cl under the random choices of the algorithm. By the proof of 
Lemma 3.5, for small enough Ci, C2 > 0, we have A n [A£] < n~ C2 . On A n , the 
lemmas in Sections 3, 4 and 5 hold with probability 1 — l/poly(n). Then the 
topologies are correctly reconstructed by the Claim in Section 5.2. □ 
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Proof. (Theorem 1) Let 



(1» * (!>') e [jA n . 

n>l 

Then, by Theorem 3, the algorithm correctly reconstructs the topologies in (T, nu) 
with probability 1 — l/poly(n) on sequences of length k = poly(n). Repeating 
the reconstruction on independent sequences and taking a majority vote, we get 
almost sure convergence to the correct topologies. The same holds for (T, v'). 
Hence, 

V l [(T,is),Q}^V l [(T',is'),Q\. 

□ 

Proof. (Theorem 2) Let 

(1» e \jA n , 

n>l 

with = 2 and v = (1/2, 1/2). Then, from the proof of Lemma 5.1, there exists 
a clustering statistic such that samples from T\ and T 2 are correctly distinguished 
with probability 1 — l/poly(n). Recall that 

\\V - T?'|| TV = sup \V(B) - V(B)\ . 

Taking B to be the event that a site is recognized as belonging to component 1 by 
the clustering statistic above, we get 

||p,[r 1 ,g]-p I [r 2 ,g]|| TV = i- 0n (i). 

□ 

7 Concluding remarks 

Our techniques also admit the following extensions: 

• When Q is unknown, one can still apply our technique by using the fol- 
lowing idea. Note that all we need is an eigenvector of Q with negative 
eigenvalue. Choose a pair (a, b) of close leaves using, for instance, the clas- 
sical log-det distance [SS03]. Under a permutation-invariant measure, (a, b) 
is stretched in all but one component, with high probability. One can then 
compute an eigenvector decomposition of the transition matrix between a 
and b. We leave out the details. 
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• The minimum frequency assumption is not necessary as long as one has an 
upper bound on the number of components and that one requires only that 
frequent enough components be detected and reconstructed. We leave out 
the details. 
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A Useful lemmas 

Recall the following standard concentration inequalities (see e.g. [MR95]): 

Lemma A.l (Azuma-Hoeffding Inequality). Suppose Z = (Z 1: . . . , Z m ) are in- 
dependent random variables taking values in a set S, and h : S m — > R is any 
t-Lipschitz function: |/i(z) — h(z')\ < t whenever z,z' e S m differ at just one 
coordinate. Then, W( > 0, 



F[\h(Z)-E[h(Z)}\ >(]< 2exp 
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Lemma A.2 (Chernoff bounds). Let Zi, . . . , Z m be independent Poisson trials 
such that, for 1 < % < m, P[Z, = 1] — Pi where < pi < 1. TTzen, /or 
2 = E™i ^ M = E[Z] = YZiPu < 6- < l, and 5 + >2e-l, 

F[Z < (1-S-)M\ <e~ M5 - /2 , 

and 

F[Z > (1 + S+)M\ < 2^ 1+5+)M . 
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